RzShell: refactor string, regex and byte search #4762

Rot127 · 2024-12-10T19:06:13Z

Your checklist for this pull request

I've read the guidelines for contributing to this repository
I made sure to follow the project's coding style
I've documented or updated the documentation of every function and struct this PR changes. If not so I've explained why.
I've added tests that prove my fix is effective or that my feature works (if possible)
I've updated the rizin book with the relevant information (if needed)

Detailed description

Changes made

Moves all legacy search commands to RzShell (only commands, inside they still do their string parsing on arguments).
Refactor string and byte search
Move to RzShell
Moves: / to /z.
Add support for Unicode and EBCDIC string search.
Add support for (Unicode) regex string search.
Add support for byte string regex search /xr.
Add more details to the search help messages.
Offsets of the search hits align with the actual encoding. Not with the UTF-8 encoding.
Dispatches memory chunks for search into threads.
Changes to ps
- Adds extra arguments to specify encoding (also EBCDIC).
- Add additional delimiter argument (stop at first non-printable).
- Document it more.
- Add psu alias for ps utf8
Changes to Settings
- Remove str.search.max_uni_blocks - Effectively a metric the user should not know about; adds too much complexity.
- str.search.max_threads -> search.max_threads - This is a general setting for the search now.
- str.search.raw_alignment -> search.str.raw_alignment - Unify settings (only used for RzBin search.).
- str.search.encoding -> str.encoding - Valid for all string interpretations.
- str.search.min_length -> search.str.min_length - Unify settings.
- str.search.buffer_size -> search.str.max_length - Unify settings.
- str.search.max_region_size -> search.str.max_region_size - Unify settings.
- str.search.check_ascii_freq -> search.str.check_ascii_freq - Unify settings.
Removed commands
- /! - Because the command modifiers are not properly handled in RzShell yet and the advantage of this one is dubious (IMHO).
- /f - Modifier and obsolete, because search is dispatched into threads.
- /b - Modifier and obsolete, because search is dispatched into threads.
- /+ - Because no idea what it does. Seems not particular useful.
- /e - Replaced with regex search in bytes and string search.
- /w - All Unicode is searched now properly with /z.
Make some changes to the string escaping, so it works reliably with Unicode characters.
- The RzStrEscOptions were inconsistently used.
  E.g. show_asciidot (replace non-printable ascii with dot) was ignored for \n, \t etc.
- Defined Unicode code points are escaped now with \U00hhhhhh. All other non-printable bytes are escaped with \xhh. There are still some exceptions (when legacy escape functions are used) but most places are ok now.
General
- Fix inconsistencies in unicode decoders/encoders and checkers. They now either return 0 on an invalid decode or the number of bytes the code point requires.
- Add many unit tests for Unicode related logic.
- Update Unicode tables to Version 16.
- Add helper to check code points.
- Escaped strings now escape valid Unicode code points to /Uhhhhhh (if not requested otherwise by the user) and invalid code points to /xhh.
- Add helper functions for hexadecimal strings and bits.

TODO Overview

What happens here:
Slowly copying changes from https://github.com/Rot127/rizin/tree/rz-search-reference (which is #4742 with some comments already addressed).

Will resend to fuzz-dist when all the tests pass here.

Stuff to do (without any order)

Test plan

...

Closing issues

closes #4910

Setting this flag for

librz/arch/data.c

librz/util/ebcdic.c

librz/util/str_search.c

Most calls to process_one_string() never decode a valid string. Before it allocated a complete buffer on the heap nontheless. Even if it freed it after one iteration. This is prevented now, by first decoding onto the stack and then continues on the heap when the string is reasonably long.

Rot127 · 2025-02-20T15:17:55Z

Going to squash commits into reasonable parts and push to dist-fuzz.

notxvilka · 2025-02-20T15:19:19Z

Going to squash commits into reasonable parts and push to dist-fuzz.

@Rot127 please open a new PR with that.

notxvilka

An immense amount of work and way better test coverage. While there are still things that could be improved, I believe they could go in separate PRs to not block this PR anymore.
Thus, LGTM. Let's merge it once green and has better history! Kudos!

notxvilka · 2025-02-20T15:19:59Z

.github/workflows/ci.yml

@@ -99,7 +99,7 @@ jobs:
            os: ubuntu-22.04
            build_system: meson
            compiler: gcc-12
-            cflags: "-DASAN=1 -DRZ_ASSERT_STDOUT=1 -ftrivial-auto-var-init=pattern -funsigned-char"


Open an issue about it also

librz/arch/p/analysis/analysis_arm_cs.c

librz/core/cconfig.c

notxvilka · 2025-02-20T15:26:50Z

librz/include/rz_util/rz_str.h

@@ -31,6 +33,7 @@ typedef enum {
 	RZ_STRING_ENC_EBCDIC_US = 's',
 	RZ_STRING_ENC_EBCDIC_ES = 't',
 	RZ_STRING_ENC_GUESS = 'g',
+	RZ_STRING_ENC_SETTINGS = 'S', ///< Use str.encoding.


@wargio you never answered this one. It's okay for now, but would be nice to separate.

notxvilka · 2025-02-20T15:28:40Z

librz/search/search_internal.h

+#include <rz_list.h>
+#include <rz_th.h>
+
+#define RZ_SEARCH_AES_LENGTH         40


Should be in a separate file, I think, but is fine for this PR, could be done afterwards.

Yes, will be. With the AES search refactored. @wargio already moved it into a separted file before.

keep this internal. no real need to move this, and since i want to refactor more that type of search with new features, i think it can be ok to be there for now.

librz/util/hex.c

librz/include/rz_util/rz_str.h

librz/include/rz_util/rz_unicode.h

Rot127 · 2025-02-20T17:13:45Z

Close for PR with cleaned up history.

Superseded by #4919

github-actions bot added RzAnalysis rz-find API ESIL RzCore ARM MIPS PPC X86 RzUtil RZIL RzSearch labels Dec 10, 2024

Rot127 force-pushed the rz-search branch from c0abf35 to 10c86ed Compare December 10, 2024 20:09

github-actions bot added API RzSearch and removed RzAnalysis rz-find API ESIL ARM MIPS PPC X86 RzUtil RZIL RzSearch labels Dec 10, 2024

Rot127 force-pushed the rz-search branch 2 times, most recently from 0c558c2 to b59292d Compare December 12, 2024 19:41

github-actions bot added the RzUtil label Dec 12, 2024

Rot127 force-pushed the rz-search branch from b59292d to f05f421 Compare December 13, 2024 21:08

Add segfault test for issue 4910.

a685ba5

Rot127 force-pushed the rz-search branch from e13f54a to a685ba5 Compare February 18, 2025 13:24

Rot127 added 2 commits February 18, 2025 09:49

Obey user set string encoding when analyzing data.

ae6c0c9

Prevent misaligned passing of memory references.

e85efc6

Setting this flag for

github-actions bot added the infrastructure label Feb 18, 2025

Don't duplicate string for copy.

8925bde

Rot127 force-pushed the rz-search branch from be3291e to 8925bde Compare February 19, 2025 15:11

Fix false positives

d8107fe

Rot127 force-pushed the rz-search branch from 109d26c to d8107fe Compare February 19, 2025 15:39

Increase timeout for ASAN tests

32a493c

This comment was marked as resolved.

Sign in to view

wargio reviewed Feb 20, 2025

View reviewed changes

librz/arch/data.c Show resolved Hide resolved

wargio reviewed Feb 20, 2025

View reviewed changes

librz/util/ebcdic.c Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

Rot127 force-pushed the rz-search branch from 8b50f3d to 62588cb Compare February 20, 2025 13:37

wargio reviewed Feb 20, 2025

View reviewed changes

librz/util/str_search.c Outdated Show resolved Hide resolved

Rot127 added 2 commits February 20, 2025 08:50

Enforce UTF-8 for GO binaries.

acaef76

Rot127 force-pushed the rz-search branch from 62588cb to acaef76 Compare February 20, 2025 14:00

wargio approved these changes Feb 20, 2025

View reviewed changes

wargio requested a review from notxvilka February 20, 2025 14:11

notxvilka approved these changes Feb 20, 2025

View reviewed changes

wargio reviewed Feb 20, 2025

View reviewed changes

librz/include/rz_util/rz_str.h Show resolved Hide resolved

wargio reviewed Feb 20, 2025

View reviewed changes

librz/include/rz_util/rz_unicode.h Show resolved Hide resolved

Rot127 closed this Feb 20, 2025

Rot127 mentioned this pull request Feb 20, 2025

RzShell: refactor string, regex and byte search #4919

Merged

9 tasks

Rot127 deleted the rz-search branch February 22, 2025 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RzShell: refactor string, regex and byte search #4762

RzShell: refactor string, regex and byte search #4762

Rot127 commented Dec 10, 2024 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

Rot127 commented Feb 20, 2025

notxvilka commented Feb 20, 2025

notxvilka left a comment

notxvilka Feb 20, 2025

notxvilka Feb 20, 2025

notxvilka Feb 20, 2025

Rot127 Feb 20, 2025

wargio Feb 20, 2025

Rot127 commented Feb 20, 2025 •

edited

Loading

RzShell: refactor string, regex and byte search #4762

RzShell: refactor string, regex and byte search #4762

Conversation

Rot127 commented Dec 10, 2024 • edited Loading

Changes made

TODO Overview

This comment was marked as resolved.

This comment was marked as resolved.

Rot127 commented Feb 20, 2025

notxvilka commented Feb 20, 2025

notxvilka left a comment

Choose a reason for hiding this comment

notxvilka Feb 20, 2025

Choose a reason for hiding this comment

notxvilka Feb 20, 2025

Choose a reason for hiding this comment

notxvilka Feb 20, 2025

Choose a reason for hiding this comment

Rot127 Feb 20, 2025

Choose a reason for hiding this comment

wargio Feb 20, 2025

Choose a reason for hiding this comment

Rot127 commented Feb 20, 2025 • edited Loading

Rot127 commented Dec 10, 2024 •

edited

Loading

Rot127 commented Feb 20, 2025 •

edited

Loading